AITopics | softmax bottleneck

Collaborating Authors

softmax bottleneck

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Sigsoftmax: Reanalysis of the Softmax Bottleneck

Neural Information Processing SystemsMar-16-2026, 23:28:30 GMT

Softmax is an output activation function for modeling categorical probability distributions in many applications of deep learning. However, a recent study revealed that softmax can be a bottleneck of representational capacity of neural networks in language modeling (the softmax bottleneck). In this paper, we propose an output activation function for breaking the softmax bottleneck without additional parameters. We re-analyze the softmax bottleneck from the perspective of the output set of log-softmax and identify the cause of the softmax bottleneck. On the basis of this analysis, we propose sigsoftmax, which is composed of a multiplication of an exponential function and sigmoid function. Sigsoftmax can break the softmax bottleneck. The experiments on language modeling demonstrate that sigsoftmax and mixture of sigsoftmax outperform softmax and mixture of softmax, respectively.

artificial intelligence, machine learning, softmax bottleneck, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Sigsoftmax: Reanalysis of the Softmax Bottleneck

Sekitoshi Kanai, Yasuhiro Fujiwara, Yuki Yamanaka, Shuichi Adachi

Neural Information Processing SystemsFeb-13-2026, 22:02:42 GMT

Neural Information Processing Systems http://nips.cc/

sigsoftmax, softmax, softmax bottleneck, (14 more...)

Neural Information Processing Systems

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Mixtape: Breaking the Softmax Bottleneck Efficiently

Zhilin Yang, Thang Luong, Russ R. Salakhutdinov, Quoc V. Le

Neural Information Processing SystemsFeb-12-2026, 04:17:07 GMT

Neural Information Processing Systems http://nips.cc/

arxiv preprint arxiv, batch size, mixtape, (10 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada (0.04)

Genre: Research Report (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Sigsoftmax: Reanalysis of the Softmax Bottleneck

Neural Information Processing SystemsNov-20-2025, 22:43:24 GMT

name change, sigsoftmax, softmax bottleneck, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Sigsoftmax: Reanalysis of the Softmax Bottleneck

Sekitoshi Kanai, Yasuhiro Fujiwara, Yuki Yamanaka, Shuichi Adachi

Neural Information Processing SystemsNov-20-2025, 18:42:48 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, softmax, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > New York (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(2 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Limitations of Normalization in Attention Mechanism

Mudarisov, Timur, Burtsev, Mikhail, Petrova, Tatiana, State, Radu

arXiv.org Artificial IntelligenceOct-21-2025

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

machine learning, natural language, softmax, (21 more...)

arXiv.org Artificial Intelligence

2508.17821

Country: Europe (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.80)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Mixtape: Breaking the Softmax Bottleneck Efficiently

Zhilin Yang, Thang Luong, Russ R. Salakhutdinov, Quoc V. Le

Neural Information Processing SystemsOct-2-2025, 17:36:04 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (14 more...)

Neural Information Processing Systems

Genre: Research Report (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Reviews: Sigsoftmax: Reanalysis of the Softmax Bottleneck

Neural Information Processing SystemsOct-7-2024, 21:31:11 GMT

The paper analyzes ability of the soft-max, if used as the output activation function in NN, to approximate posterior distribution. The problem is translated to the study of the rank of the matrices contating the log-probabilities computed by the analyzed activation layer. It is shown that the soft-max does not increases the rank of the input response matrix (i.e. The authors propose to replace soft-max by the so called sigsoftmax (i.e. It is shown that the rank of sigsoftmax matrix is not less the rank of soft-max.

activation function, artificial intelligence, machine learning, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck

Godey, Nathan, de la Clergerie, Éric, Sagot, Benoît

arXiv.org Artificial IntelligenceApr-11-2024

Recent advances in language modeling consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in practice, which incentivizes the use of smaller counterparts. However, it has been observed that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point in training followed by a plateau. In this paper, we find that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution. This mismatch affects the performance of the linear prediction head used in such models through the well-known softmax bottleneck phenomenon. We measure the effect of the softmax bottleneck in various settings and find that models based on less than 1000 hidden dimensions tend to adopt degenerate latent representations in late pretraining, which leads to reduced evaluation performance.

computational linguistic, language model, representation, (14 more...)

arXiv.org Artificial Intelligence

2404.07647

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(10 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.92)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

HistAlign: Improving Context Dependency in Language Generation by Aligning with History

Wan, David, Zhang, Shiyue, Bansal, Mohit

arXiv.org Artificial IntelligenceDec-3-2023

Language models (LMs) can generate hallucinations and incoherent outputs, which highlights their weak context dependency. Cache-LMs, which augment LMs with a memory of recent history, can increase context dependency and have shown remarkable performance in diverse language generation tasks. However, we find that even with training, the performance gain stemming from the cache component of current cache-LMs is suboptimal due to the misalignment between the current hidden states and those stored in the memory. In this work, we present HistAlign, a new training approach to ensure good cache alignment such that the model receives useful signals from the history. We first prove our concept on a simple and synthetic task where the memory is essential for correct predictions, and we show that the cache component of HistAlign is better aligned and improves overall performance. Next, we evaluate HistAlign on diverse downstream language generation tasks, including prompt continuation, abstractive summarization, and data-to-text. We demonstrate that HistAlign improves text coherence and faithfulness in open-ended and conditional generation settings respectively. HistAlign is also generalizable across different model families, showcasing its strength in improving context dependency of LMs in diverse scenarios. Our code is publicly available at https://github.com/meetdavidwan/histalign

computational linguistic, ist, linguistic, (17 more...)

arXiv.org Artificial Intelligence

2305.04782

Country:

Europe > Austria (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(12 more...)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.90)

Add feedback